[Bug][RayJob] Sidecar mode shouldn't restart head pod when head pod is deleted #4234

400Ping · 2025-11-27T10:45:08Z

Why are these changes needed?

In this PR, we fail the RayJob if the RayCluster is provisioned. However, this is a breaking change since we shouldn’t include other modes like K8sJobMode or HTTPMode.

Related issue number

Closes #4176

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

…s deleted Signed-off-by: 400Ping <[email protected]>

Signed-off-by: 400Ping <[email protected]>

400Ping · 2025-11-28T15:46:41Z

cc @Future-Outlier @rueian PTAL

Future-Outlier

Hi, @400Ping
let us know if you need any help, plz ping us

400Ping · 2025-12-15T09:58:57Z

Ok, will do thanks.

Signed-off-by: 400Ping <[email protected]>

ray-operator/controllers/ray/raycluster_controller.go

Signed-off-by: 400Ping <[email protected]>

rueian · 2025-12-21T12:07:12Z

ray-operator/test/e2erayjob/rayjob_test.go

+			return len(pods.Items)
+		}, TestTimeoutMedium, 2*time.Second).Should(Equal(1))
+		g.Consistently(RayJob(test, rayJob.Namespace, rayJob.Name), TestTimeoutShort).
+			ShouldNot(WithTransform(RayJobDeploymentStatus, Equal(rayv1.JobDeploymentStatusFailed)))


CI keeps failing

I think when the head pod is deleted in default k8sJobMode, we will get into following block and the job deployment status will still be Failed. Don't know if this is the expected behavior?

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 302 to 307 in 2464704

if finishedAt != nil {

rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed

rayJobInstance.Status.Reason = rayv1.AppFailed

rayJobInstance.Status.Message = "Submitter completed but Ray job not found in RayCluster."

break

}

I think I now understand why this logic is written this way, we currently have two scenarios:

The RayJob loses its head pod before the ray job is submitted.

The RayJob loses its head pod after the ray job has already been submitted.

For scenario 1:

When the head Pod is deleted while the submitter Pod is attempting to submit the ray job, the submitter encounters a WebSocket connection failure and exits with a non-zero exit code. The K8s Job then creates a new submitter Pod to retry (the submitterBackoffLimit default is 2).

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 663 to 665 in dea5baa

func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *rayv1.RayJob, submitterTemplate corev1.PodTemplateSpec) error {

logger := ctrl.LoggerFrom(ctx)

submitterBackoffLimit := ptr.To[int32](2)

Meanwhile, the RayCluster controller recreates the head Pod and worker Pods (only sidecar mode skips head Pod restart).

If the new head Pod becomes ready before the K8s Job reaches backoffLimit, the new submitter Pod could successfully submit the ray job, and the RayJob could eventually transition to JobDeploymentStatusComplete with JobStatus = JobStatusSucceeded. But this is need a good luck.

For Scenario 2:
When the head Pod is deleted after the ray job has already been submitted, the K8s Job is marked as Completed because the Pod exits with exit code 0, even though the ray job itself has not completed.

And then, when the controller tries to query the ray job status via GetJobInfo, it returns BadRequest (job not found). The RayJob will transition JobDeploymentStatusFailed with JobStatus = JobStatusFailed. (This is fix by #3860)

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 289 to 306 in dea5baa

jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId)

if err != nil {

// If the Ray job was not found, GetJobInfo returns a BadRequest error.

if errors.IsBadRequest(err) {

if rayJobInstance.Spec.SubmissionMode == rayv1.HTTPMode {

logger.Info("The Ray job was not found. Submit a Ray job via an HTTP request.", "JobId", rayJobInstance.Status.JobId)

if _, err := rayDashboardClient.SubmitJob(ctx, rayJobInstance); err != nil {

logger.Error(err, "Failed to submit the Ray job", "JobId", rayJobInstance.Status.JobId)

return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

}

return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil

}

// finishedAt will only be set if submitter finished

if finishedAt != nil {

rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed

rayJobInstance.Status.Reason = rayv1.AppFailed

rayJobInstance.Status.Message = "Submitter completed but Ray job not found in RayCluster."

break

If not returns BadRequest then goes to JobDeploymentStatusTransitionGracePeriodExceeded.

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 1138 to 1161 in dea5baa

func checkSubmitterFinishedTimeoutAndUpdateStatusIfNeeded(ctx context.Context, rayJob *rayv1.RayJob, finishedAt *time.Time) bool {

logger := ctrl.LoggerFrom(ctx)

// Check if timeout is configured and submitter has finished

if finishedAt == nil {

return false

}

// Check if timeout has been exceeded

if time.Now().Before(finishedAt.Add(DefaultSubmitterFinishedTimeout)) {

return false

}

logger.Info("The RayJob has passed the submitterFinishedTimeoutSeconds. Transition the status to terminal.",

"SubmitterFinishedTime", finishedAt,

"SubmitterFinishedTimeoutSeconds", DefaultSubmitterFinishedTimeout.String())

rayJob.Status.JobStatus = rayv1.JobStatusFailed

rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed

rayJob.Status.Reason = rayv1.JobDeploymentStatusTransitionGracePeriodExceeded

rayJob.Status.Message = fmt.Sprintf("The RayJob submitter finished at %v but the ray job did not reach terminal state within %v",

finishedAt.Format(time.DateTime), DefaultSubmitterFinishedTimeout)

return true

}

As Kai-Hsun mentioned in slack, I think we can simply remove this line, since the behavior in Scenario 1 is not reliable.

I think it is pretty rare, and typically it will hit backoffLimit before the new Pod is ready and the whole cluster needs to restart. A lot of potential unexpected behavior here. Users should not rely on it.

https://ray.slack.com/archives/C07M825T94Z/p1762216441714309?thread_ts=1761694399.867989&cid=C07M825T94Z

ray-operator/controllers/ray/rayjob_controller.go

Signed-off-by: Future-Outlier <[email protected]>

Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>

ray-operator/controllers/ray/utils/constant.go

ray-operator/test/e2erayjob/rayjob_sidecar_mode_test.go

ray-operator/test/support/ray.go

ray-operator/controllers/ray/rayjob_controller.go

Future-Outlier

Hi, @400Ping do you mind solve the comments?

Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>

Co-authored-by: Nary Yeh <[email protected]> Signed-off-by: Ping <[email protected]>

Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>

ray-operator/controllers/ray/rayjob_controller.go

Signed-off-by: 400Ping <[email protected]>

ray-operator/controllers/ray/rayjob_controller.go

Signed-off-by: 400Ping <[email protected]>

400Ping · 2026-01-10T04:31:09Z

Hi, @400Ping do you mind solve the comments?

Done, PTAL

win5923

LGTM

ray-operator/test/e2erayjob/rayjob_sidecar_mode_test.go

ray-operator/test/e2erayjob/rayjob_test.go

ray-operator/test/e2erayjob/rayjob_sidecar_mode_test.go

ray-operator/controllers/ray/rayjob_controller.go

ray-operator/test/e2erayjob/rayjob_test.go

Signed-off-by: 400Ping <[email protected]>

[Bug][RayJob] Sidecar mode shouldn't restart head pod when head pod i…

cb6172c

…s deleted Signed-off-by: 400Ping <[email protected]>

400Ping requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners November 27, 2025 10:45

[fix] fix CI error

415ee29

Signed-off-by: 400Ping <[email protected]>

400Ping marked this pull request as draft November 29, 2025 07:34

Future-Outlier reviewed Dec 15, 2025

View reviewed changes

update

2bbf8cb

Signed-off-by: 400Ping <[email protected]>

400Ping marked this pull request as ready for review December 15, 2025 16:03

400Ping requested a review from Future-Outlier December 15, 2025 16:03

rueian reviewed Dec 16, 2025

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

400Ping added 3 commits December 17, 2025 09:29

reunite if statement

2bf33f3

Signed-off-by: 400Ping <[email protected]>

update

a97a3b5

Signed-off-by: 400Ping <[email protected]>

fix ci error

c4bfd24

Signed-off-by: 400Ping <[email protected]>

400Ping force-pushed the bug/sidecar-mode-fix branch from b8077b9 to c4bfd24 Compare December 18, 2025 11:37

400Ping added 2 commits December 18, 2025 23:37

fix

e7499ad

Signed-off-by: 400Ping <[email protected]>

put back unnecessary comment deletion

714d760

Signed-off-by: 400Ping <[email protected]>

rueian reviewed Dec 21, 2025

View reviewed changes

Future-Outlier moved this to In Progress in @Future-Outlier's kuberay project Dec 22, 2025

Future-Outlier added this to @Future-Outlier's kuberay project Dec 22, 2025

Future-Outlier reviewed Dec 22, 2025

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

Future-Outlier reviewed Dec 22, 2025

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

Future-Outlier added 5 commits December 22, 2025 13:08

Better rayjob logic

60aba9c

Signed-off-by: Future-Outlier <[email protected]>

update

8a7c66f

Signed-off-by: Future-Outlier <[email protected]>

update

45bb98a

Signed-off-by: Future-Outlier <[email protected]>

update

59ef8b3

Signed-off-by: Future-Outlier <[email protected]>

update

2464704

Signed-off-by: Future-Outlier <[email protected]>

400Ping and others added 2 commits January 6, 2026 13:19

Update ray-operator/controllers/ray/raycluster_controller.go

d02c6a7

Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>

Update ray-operator/controllers/ray/rayjob_controller.go

f3d9431

Co-authored-by: Rueian <[email protected]> Signed-off-by: Ping <[email protected]>